textual prompt
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- North America > United States (0.14)
- Europe > United Kingdom (0.14)
- Asia > China > Guangxi Province > Nanning (0.04)
- Transportation > Passenger (1.00)
- Transportation > Air (1.00)
- Aerospace & Defense > Aircraft (1.00)
- (2 more...)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (0.96)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > Maryland > Baltimore (0.04)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- (6 more...)
- Asia > China > Shanghai > Shanghai (0.04)
- North America > United States > Minnesota (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.93)
- Health & Medicine > Therapeutic Area > Oncology (0.93)
- Health & Medicine > Diagnostic Medicine > Imaging (0.70)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Data Science (0.67)
Vision-Language Models are Strong Noisy Label Detectors
Recent research on fine-tuning vision-language models has demonstrated impressive performance in various downstream tasks. However, the challenge of obtaining accurately labeled data in real-world applications poses a significant obstacle during the fine-tuning process. To address this challenge, this paper presents a Denoising Fine-Tuning framework, called DeFT, for adapting vision-language models.
Prompt-based Consistent Video Colorization
Dani, Silvia, Uricchio, Tiberio, Seidenari, Lorenzo
Existing video colorization methods struggle with temporal flickering or demand extensive manual input. We propose a novel approach automating high-fidelity video colorization using rich semantic guidance derived from language and segmentation. We employ a language-conditioned diffusion model to colorize grayscale frames. Guidance is provided via automatically generated object masks and textual prompts; our primary automatic method uses a generic prompt, achieving state-of-the-art results without specific color input. Temporal stability is achieved by warping color information from previous frames using optical flow (RAFT); a correction step detects and fixes inconsistencies introduced by warping. Evaluations on standard benchmarks (DAVIS30, VIDEVO20) show our method achieves state-of-the-art performance in colorization accuracy (PSNR) and visual realism (Colorfulness, CDC), demonstrating the efficacy of automated prompt-based guidance for consistent video colorization.
- Europe > Italy > Tuscany > Pisa Province > Pisa (0.04)
- Asia > Middle East > Jordan (0.04)
Unified Text-Image-to-Video Generation: A Training-Free Approach to Flexible Visual Conditioning
Lai, Bolin, Lee, Sangmin, Cao, Xu, Li, Xiang, Rehg, James M.
Text-image-to-video (TI2V) generation is a critical problem for controllable video generation using both semantic and visual conditions. Most existing methods typically add visual conditions to text-to-video (T2V) foundation models by finetuning, which is costly in resources and only limited to a few pre-defined conditioning settings. To tackle these constraints, we introduce a unified formulation for TI2V generation with flexible visual conditioning. Furthermore, we propose an innovative training-free approach, dubbed FlexTI2V, that can condition T2V foundation models on an arbitrary amount of images at arbitrary positions. Specifically, we firstly invert the condition images to noisy representation in a latent space. Then, in the denoising process of T2V models, our method uses a novel random patch swapping strategy to incorporate visual features into video representations through local image patches. To balance creativity and fidelity, we use a dynamic control mechanism to adjust the strength of visual conditioning to each video frame. Extensive experiments validate that our method surpasses previous training-free image conditioning methods by a notable margin. Our method can also generalize to both UNet-based and transformer-based architectures.
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- Leisure & Entertainment (0.68)
- Media (0.46)
- Transportation (0.46)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (0.88)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)